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Abstract: We present a technique for the animation of a 3D kinematic tongue 
model, one component of the talking head of an acoustic-visual (AV) speech syn- 
thesizer. The skeletal animation approach is adapted to make use of a deformable 
rig controlled by tongue motion capture data obtained with electromagnetic artic- 
ulography (EMA), while the tongue surface is extracted from volumetric magnetic 
resonance imaging (MRI) data. Initial results are shown and future work outlined. 

1 Introduction 

As part of ongoing research in developing a fully data-driven acoustic-visual (AV) text-to- 
speech (TTS) synthesizer [16], we integrate a tongue model to increase visual intelligibility 
and naturalness. To extend the kinematic paradigm used for facial animation in the synthesizer 
to tongue animation, we adapt state-of-the-art techniques of animation with motion-capture data 
for use with electromagnetic articulography (EMA). 

Our AV synthesizer 1 is based on a non-uniform unit-selection TTS system for French [4], con- 
catenating bimodal units of acoustic and visual data, and extending the selection algorithm 
with visual target and join costs [13]. The result is an application whose graphical user inter- 
face (GUI) features a "talking head" (i.e. computer-generated face), which is animated syn- 
chronously with the synthesized acoustic output. 

This synthesizer depends on a speech corpus acquired by tracking marker points painted onto 
the face of a human speaker, using a stereoscopic high-speed camera array, with simultaneously 
recorded audio. While the acoustic data is used for waveform concatenation in a conventional 
unit- selection paradigm, the visual data is post-processed to obtain a dense, animated 3D point 
cloud representing the speaker's face. The points are interpreted as the vertices of a mesh, which 
is then rendered as an animated surface to generate the face of the talking head using a standard 
vertex animation paradigm. 

Due to the nature of the acquisition setup, no intra-oral articulatory motion data can be simulta- 
neously captured. At the very least, any invasive instrumentation, such as EMA wires or trans- 
ducer coils, would have a negative effect on the speaker's articulation and hence, the quality 
of the recorded audio; additional practical issues (e.g. coil detachment) would limit the length 
of the recording session, and by extension, the size of the speech corpus. As a consequence, 
the synthesizer's talking head currently features neither tongue nor teeth, which significantly 
decreases both the naturalness of its appearance and its visual intelligibility. 

To address this shortcoming, we develop an independently animated 3D tongue and teeth model, 

L http : //visac . loria . f r/ 
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which will be integrated into the talking head and eventually controlled by interfacing directly 
with the TTS synthesizer. 

2 EM A -based tongue model animation 

To maintain the data-driven paradigm of the AV synthesizer, the tongue model 2 consists of a 
geometric mesh rendered in the GUI along with (or rather, "behind") the face. Since the primary 
purpose of the tongue model is to improve the visual aspects of the synthesizer and it has no 
influence on the acoustics, there is no requirement for a complex tongue model to calculate the 
vocal tract transfer function, etc. Therefore, in contrast to previous work [e.g. 5, 6, 8, 10, 14, 17], 
most of which attempts to predict tongue shape and/or motion by simulating the dynamics in one 
form or another, we must merely generate realistic tongue kinematics, without having to model 
the anatomical structure of the human tongue or satisfy physical or biomechanical constraints. 

This scenario allows us to make use of standard animation techniques using motion capture 
data. Specifically, we apply electromagnetic articulography (EMA) using a Carstens AG500 3 to 
obtain high-speed (200 Hz), 3D motion capture data of the tongue during speech [7]. 

While other modalities might be used to acquire the shape of the tongue while speaking, their 
respective drawbacks make them ill-suited to our needs. For example, ultrasound tongue imag- 
ing tends to require extensive processing to track the mid- sagittal tongue contour and does not 
usually capture the tongue tip, while real-time magnetic resonance imaging (MRI) has a very 
low temporal resolution, and is currently possible only in a single slice. 4 

2.1 Tongue motion capture 

Conventional motion capture modalities (as widely used e.g. in the animation industry) nor- 
mally employ a camera array to track optical markers attached to the face or body of a human 
actor, producing data in the form of a 3D point cloud sampled over time. For facial animation, 
these points (given sufficient density) can be directly used as vertices of a mesh representing 
the surface of the face; this is the vertex animation approach taken in the AV synthesizer (see 
above). 

For articulated body animation, however, the 3D points are normally used as transformation 
targets for the rigid bones of a hierarchically structured (usually humanoid) skeleton model. 
Much like the strings controlling a marionette, the skeletal transformations are then applied to a 
virtual character by deforming its geometric mesh accordingly, a widely used technique known 
as skeletal animation. 

Since current EMA technology allows the tracking of no more than 12 transducer coils (usually 
significantly fewer on the tongue), the resulting data is too sparse for vertex animation of the 
tongue surface. For this reason, we adopt a skeletal animation approach, but without enforcing 
a rigid structure, since the human tongue contains no bones and is extremely deformable. This 
issue is addressed below. 

One advantage of EMA lies in the fact that the data produced is a set of 3D vectors, not points, 
as the AG500 tracks the orientation, as well as position, of each transducer coil. Thus, the 
rotational information supplements, and compensates to some degree for the sparseness of, 

2 For reasons of brevity, in the remainder of this paper, we will refer only to a tongue model, but it should be 
noted that that such a model can easily encompass upper and lower teeth in addition to the tongue. 
3 Carstens Medizinelektronik GmbH, http : //www. articulograph . de/ 

4 3D cine-MRI of the vocal tract [15], while possible, is far from realistic for the compilation of a full speech 
corpus sufficient for TTS. 
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Figure 1 - Coil layout for EMA test corpus Figure 2 - Perspective view of EMA coils 

rendered as primitive cones to visualize their 
orientation 

the positional data. Technically, this corresponds to motion capture approaches such as [11], 
although the geometry is of course quite different for the tongue than for a humanoid skeleton. 

As a small EMA test corpus, we recorded one speaker using the AG500, with the following 
measurement coil layout: tongue tip center, tongue blade left/right, tongue mid center/left/right, 
tongue back center, lower incisor, upper lip (reference coils on bridge of nose and behind each 
ear). The exact arrangement can be seen in Figure 1. The speech material comprises sustained 
vowels in the set [i, y, u, e, 0, o, a, a], repetitive CV syllables permuting these vowels with the 
consonants in the set [p, t, k, m, n, q, f, 9, s, J", c, x, 1, 1], as well as 10 normal sentences in 
German and English, respectively. A 3D palate trace was also obtained. 

We imported the raw EMA data as keyframes into the animation component of a fully-featured, 
open-source, 3D modeling and animation software suite, 5 using a custom plugin. Unlike point 
cloud based motion capture data contained in industry standard formats such as C3D [12], this 
allows us to directly import the rotational data as well. As an example of the result, one frame 
is displayed in Figure 2. Within each frame of the animation, the EMA coil objects can provide 
the transformation targets for an arbitrary skeleton. 

Once the motion capture data has been imported into the 3D software, it can be segmented into 
distinct actions for use and re-use in non-linear animation (NLA). This allows us to manipu- 
late and concatenate any number of frame sequences as atomic actions, and to synthesize new 
animations from them, using e.g. the 3D software's NLA editor (which, for these purposes, is 
conceptually similar to a gestural score in articulatory phonology [3]). 

2.2 Tongue model animation 

In order to use the tongue motion capture data to control a tongue model using skeletal anima- 
tion, we design a simple skeleton as a rig for the tongue mesh. This rig consists of a central 
"spine", and two branches to allow (potentially asymmetric) lateral movement, such as groov- 
ing. Once again, it must be pointed out that this rig is unrelated to real tongue anatomy, although 

5 Blender v2.5, http : / / www . blender .org/ 
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X- 

(a) Perspective view of EMA coils (rendered as cones) 
and deformable skeletal rig in bind pose ([a] vowel). 
Tongue tip at center, oriented towards left; upper lip 
and lower incisor coils are visible further left. Adapta- 
tion struts (IK targets, cf. text) are shown as thin rods 
connecting coils and B -bones 




X. 

(c) Like 3a, but with the tongue mesh bound to the rig. 
Its surface is color-shaded with a heat map visualiz- 
ing the influence of the tongue-tip B-bone on the mesh 
vertices (red=full; blue=none) 




(e) Like 3c, but showing only the tongue surface mesh, 
with the tip at center, oriented left 

Figure 3 - Tongue model in initial and final 



(b) Like 3 a, but deformed according to [t] target pose 




X- 

(d) Like 3c, but deformed according to [t] target pose 




X. 

(f) Like 3e, but deformed according to [t] target pose 
frame of one [at] cycle in the EMA test corpus 
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it could be argued that e.g. the spine corresponds roughly to the superior and inferior longitudi- 
nal muscles. 

Of course, a skeleton of rigid bones is inadequate to mimic the flexibility of a real tongue. Our 
solution is to construct the rig using deformable bones, so-called bezier bones (B-bones), which 
can bend, twist, and stretch as required, governed by a set of constraining parameters. 

The tongue model should be able to move independently of any specific EMA coil layout, since 
after all, the motion capture data represents observations of tongue movements based on hidden 
dynamics. For this reason, and to maintain as much modularity and flexibility as possible in the 
design, the animation rig is not directly connected to the EMA coils in the motion capture data. 
Instead, we introduce an adaptation layer in the form of "struts", each of which is connected to 
one coil, while the other end serves as a target for the rig's B-bones. These struts can be adapted 
to any given EMA coil layout or rig structure. 

With the struts in place and constrained to the movements of the EMA coils, the rig can be 
animated by using inverse kinematics (IK) to determine the location, rotation, and deformation 
of each B-bone for any given frame. The IK are augmented by volume constraints, which inhibit 
potential "bloating" of the rig during B-bone stretching. 

The final component for tongue model animation is a mesh that represents the tongue surface, 
which is rendered in the GUI and deformed according to the skeletal animation. While this 
tongue mesh could be an arbitrary geometric structure, we use an isosurface extracted from 
a volumetric scan in a MRI speech corpus (from a different speaker; voxel size 1.09 mm x 
1.09mm x 4mm). The tongue in this scan was manually segmented using a graphics tablet and 
open-source medical imaging software. 6 

The resulting tongue mesh was manually registered to the EMA coil positions in a neutral 
bind pose. The skeletal rig was then embedded, and vertex groups in the tongue mesh assigned 
automatically to each B-bone. As the motion capture data animates the rig using IK, the tongue 
mesh is deformed accordingly, approximately following the EMA coils. Figure 3 displays the 
initial and final frame in one cycle of repetitive [ta] articulation in the EMA test corpus. In an 
informal evaluation, our technique appears to produce satisfactory results, and encourages us to 
pursue and refine this approach to tongue model animation. 

3 Discussion and Outlook 

We have presented a technique to animate a kinematic tongue model, based on volumetric vo- 
cal tract MRI data, using skeletal animation with a flexible rig, controlled by motion capture 
data acquired with EMA, and implemented with off-the-shelf, open-source software. While this 
approach appears promising, it is still under development, and there are various issues which 
must be addressed before the tongue model can be integrated into our AV TTS synthesizer as 
intended. 

• Upper and lower teeth can be added to the model using the same data and animation 
technique, albeit with a conventional, rigid skeleton. These will then be rendered in the 
synthesizer's GUI along with the face and tongue. 

• The tongue mesh used here is quite rough, and registration with the EMA data does not 
produce the best fit, owing to differences between the speakers' vocal tract geometries and 
articulatory target positions, quite possibly exacerbated by the effects of supine posture 

6 OsiriX v3.9, http : / /www . osirix-viewer . com/ 
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during MRI scanning [e.g. 9]. A more suitable mesh might be obtained by scanning the 
tongue of the same speaker used for the EMA motion capture data, at a higher resolution. 

• Registration of the tongue mesh into the 3D space of the tongue model should be per- 
formed in a partially or fully automatic way, using landmarks available in both MRI and 
EMA modalities [cf. 1], such as the 3D palate trace and/or high-contrast markers at the 
positions of the reference coils. 

• The reliability of EMA position and orientation data is sometimes unpredictable. This 
could be due to the algorithms used to process the raw amplitude data, faulty hardware, 
interference (even within the coil layout itself), or any combination of such factors. How- 
ever, since any such errors are immediately visible in the animation of the tongue model 
by introducing implausible deformations, we are working on methods both to clean the 
EMA data itself, and to make the tongue model less susceptible to such outlier trajectory 
segments. 

• To evaluate the performance of the animation technique, factors such as skin deforma- 
tion and distance of EMA coils from the tongue model surface should be monitored. The 
structure of the skeletal rig can be independently refined, optimizing its ability to gen- 
erate realistic tongue poses. Its embedding into the tongue mesh should preferably be 
performed using a robust automatic method [e.g. 2]. 

• The 3D palate trace can be used to add a palate surface mesh to the tongue model. For 
both the palate and the teeth, the model could also be augmented with automatic collision 
detection by accessing the 3D software's integrated physics engine. 7 

For an interactive application such as the AV synthesizer GUI, it is impractical to incur the per- 
formance overhead of an elaborate 3D rendering engine, especially when a non-trivial process- 
ing load is required for the bimodal unit-selection. Instead, we anticipate integrating the tongue 
model into the talking head using a more lightweight, real-time capable 3D game engine, which 
may even offload the visual computation to dedicated graphics hardware. The advantage of us- 
ing keyframe-based, NLA actions is that they can be ported into such engines as animated 3D 
models, using common interchange formats. 8 Although the skeletal rig could be accessed and 
manipulated directly, this "pre-packaging" of animation actions also avoids the complexity, or 
perhaps even unavailability, of advanced features such as B-bones or IK in those engines. 

The final integration challenge is to interface the tongue model directly with the TTS system 
to synthesize the correct animation actions with appropriate timings. This task might be ac- 
complished using a diphone synthesis style approach, or even action unit- selection, and will be 
addressed in the near future. 
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